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Abstract 

'& ' Adaptive nuclear-norm penalization is proposed for low-rank matrix approx- 

imation, by which we develop a new reduced-rank estimation method for the 
general high-dimensional multivariate regression problems. The adaptive nuclear 
CN " norm of a matrix is defined as the weighted sum of the singular values of the 

matrix. For example, the pre-specified weights may be some negative power of 



qq . the singular values of the data matrix (or its projection in regression setting). 

C*~) ' The adaptive nuclear norm is generally non-convex under the natural restriction 

that the weight decreases with the singular value. However, we show that the 
proposed non-convex penalized regression method has a global optimal solution 
| obtained from an adaptively soft-thresholded singular value decomposition. This 

new reduced-rank estimator is computationally efficient, has continuous solution 
path and possesses better bias-variance property than its classical counterpart. 
The rank consistency and prediction/estimation performance bounds of the pro- 
j_j ■ posed estimator are established under high-dimensional asymptotic regime. Sim- 



ulation studies and an application in genetics demonstrate that the proposed 
estimator has superior performance to several existing methods. The adaptive 
nuclear-norm penalization can also serve as a building block to study a broad 
class of singular value penalties. 

1 Introduction 

Given n observations of the response G 9ft 9 and predictor Xj £ 9ft p , we consider the 
multivariate linear regression model: 

Y = XC + E, (1.1) 



* Corresponding author; kunchen@ksu.edu 



1 



where Y = (y 1; y n ) T , X = (xi, x n ) T , C is a p x q coefficient matrix, and 
E = (ei, e n ) T is a random n x q matrix of independently and identically distributed 
random errors with mean zero and variance a 2 . 

We are interested in the scenario when both the number of predictors p and the 
number of responses q may depend on and even exceed the sample size n. Such high- 
dimensional regression problems are increasingly encountered in quantitative investiga- 
tions. It is well known that ordinary least squares (OLS) estimation is equivalent to 
separately regressing each response on the set of predictors, which, however, ignores 
the dependence structure of the multivariate response and may be infeasible in high- 
dimensional settings. The curse of dimensionality may be mitigated by assuming the 
coefficient matrix C admit some low-dimensional structure and employing the regular- 
ization/penalization approach for model estimation. For example, for Gaussian data, 
it is appropriate to conduct model estimation by penalized least squares (PLS): 

lj(C)+V x (C), (1.2) 

where J7(C) = ||Y— XC|||i is the sum of squared error with ||-||f denoting the Frobenius 
norm, V\(-) is some penalty function measuring the "size" (complexity) of the enclosed 
matrix, and A is a non-negative tuning parameter controlling the degree of penalization. 

Within this general framework, an important model is reduced-rank regression 
(RRR) (Anderson, 1951, 1999, 2002; Izenman, 1975; Reinsel & Velu, 1998), in which 
dimension reduction is achieved by assuming that the coefficient matrix C is of low 
rank, i.e., its rank r(C) = r* < min(p, q). The classical RRR literature mainly 
focuses on small p cases and maximum likelihood estimation. Bunea et al. (2011) 
proposed the rank selection criterion (RSC) for high dimensional settings, and re- 
vealed that rank constrained estimation can be viewed as a PLS method (1.2) with 
the penalty proportional to the rank of the coefficient matrix. This Zo-type penalty 
can be alternatively cast as a penalty in terms of the number of non-zero singu- 
lar values of C, i.e., V X (C) = Ar(C) = A£ fc I(d fc (C) ^ 0) where /(■) is the indi- 
cator function, which results in an estimator obtained by hard-thresholded singular 
value decomposition (SVD), see Section 2. Yuan et al. (2007) proposed a nuclear- 
norm penalized least squares estimator (NNP), in which the nuclear norm penalty 
is defined as V\(C) = A||C||* = A^ fe c4(C), where || • ||* denotes the nuclear norm 
or the sum of the singular values of the enclosed matrix. This Zi-type penalty en- 
courages sparsity among the singular values and achieves simultaneous rank reduc- 
tion and shrinkage coefficient estimation (Negahban & Wainwright, 2011; Bunea et al., 
2011; Lu et al., 2012). Rohde & Tsybakov (2011) investigated the theoretical proper- 
ties of the Schatten-g quasi-norm penalty, which is defined as V\(C) = AJ^ =1 c^(C) 
for < q < 1, and nonasymptotic bounds of prediction risk were obtained. Sev- 
eral other methods or theoretical development related to reduced-rank regression ex- 
ist, see, e.g., Aldrin (2000), Negahban & Wainwright (2011), Mukherjee & Zhu (2011), 
and Chen et al. (2012). The reduced-rank metrology has connections with many pop- 
ular tools including principal component analysis and canonical correlation analysis, 
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and it is extensively studied in matrix completion problems (Candes & Recht, 2009; 
Candes et al., 2011; Koltchinskii et al., 2011). 

It is evident that the aforementioned reduced-rank approaches are closely related 
to the SVD method in matrix approximation (Eckart & Young, 1936; Reinsel Sz Velu, 
1998). It is also intriguing that the rank and nuclear norm penalized methods can be 
viewed as l and l\ penalized methods in the SVD domain, respectively. (In fact, the 
/ 2 norm of the singular values is equivalent to a ridge penalty.) Motivated by these 
connections, we propose the adaptive nuclear norm regularization method, in which 
the adaptive nuclear norm (ANN) of a matrix C is defined as a weighted sum of its 
singular values; see Xu (2009) for a similar idea related to reweighted penalization in 
the context of matrix completion. Clearly, the key is to close the gap between the 
nonsmooth Iq rank penalty and the l\ nuclear penalty while keeping the computation 
stable and efficient for high dimensional data. We show that the convexity of ANN 
depends on the ordering of the weights. Particularly, the ANN turns out to be non- 
convex for the case that the weight decreases with the singular value, a condition needed 
for a meaningful regularization, see Section 2 for further discussion. Despite the non- 
convexity, we are able to characterize the explicit global optimal solution for the ANN 
penalized matrix approximation problems. 

Based on ANN, we develop a new method of simultaneous dimension reduction and 
coefficient estimation for the general high- dimensional multivariate regression. Our 
proposal is based on two main ideas. Firstly, by penalizing the singular values adap- 
tively, our method builds a bridge between the RSC and NNP methods, and it may 
be viewed as analogous to the adaptive Lasso (Tibshirani, 1996; Zou, 2006) developed 
for univariate regression. Secondly, the ANN penalty is applied to XC rather than C 
(Koltchinskii et al., 2011); although the criterion remains non-convex, this setup allows 
the reduced-rank estimation to be solved explicitly and efficiently. Comparing to the 
computationally intensive NNP method which tends to overestimate the rank, the pro- 
posed ANN method may improve rank determination with the aid of some well-chosen 
adaptive weights. Comparing to the RSC method, the smooth ANN penalty results 
in a continuous solution path and allows more flexible bias and variance tradeoff in 
model fitting. The rank consistency and prediction/estimation performance bounds 
of the proposed estimator are established under high- dimensional asymptotic regime. 
We discuss the incorporation of an extra I2 penalization for improving reduced-rank 
estimation. Empirical studies demonstrate that the proposed methods enjoy superior 
performance in both prediction and rank estimation as compared to several existing 
methods. 

2 Adaptive nuclear norm penalty 

To study the general properties of the adaptive nuclear norm, we consider the low- 
rank matrix approximation problem, Y = C + E, which is a special case of model 
(1.1) when X becomes an identity matrix and n = p. In many applications, given the 
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noisy data matrix Y, it is of interest to seek its low-rank approximation for denoising, 
which can be achieved by various methods, e.g., rank penalization or nuclear norm 
penalization. These methods are closely related to the SVD method in low-rank matrix 
approximation, which motivated our study. 
Consider the SVD of Y G 3? nX9 , 

Y = UDV T , D = Diag{d(C)} = Diag{d}, (2.1) 

where U and V are respectively px h and qxh orthonormal matrices with h = minQo, q), 
and the vector of singular values d = (di, dh) consists of non- increasing non- negative 
singular values of Y. For any A > 0, define the hard SVD-thresholding operator (HSVT) 

Hx(Y) = U^ A (D)V T , H X (D) = Diag{d i J(d i > X),i = l,...,h}, (2.2) 

where /(•) is the indicator function, and the soft SVD-thresholding operator (SSVT) 

S X (Y) = U<S A (D)V T , 5 A (D) = Diag{(^ - A)+, % = 1, h}, (2.3) 

where x + is the non- negative part of x, namely, x + = max(0, x). It is well-known that 
in the matrix approximation problems (X = I), an HSVT estimator solves (1.2) with 
the /o-type rank penalization (Eckart & Young, 1936), while a SSVT estimator solves 
(1.2) with the Zi-type nuclear-norm penalization (Cai et al., 2010); these results are 
summarized in the proposition below. 

Proposition 2.1 For any A > and Ye W ixq , the HSVT operator H\(Y) defined by 
(2.2) can be characterized as: U\{Y) = argmin c {|| Y— C\\% + A 2 r( C)}, and the SSVT 
operator S\(Y) in (2.3) is similarly characterized as: S\(Y) = argminc7{||| Y— C\\p + 

x\\c\U}. 

The hard-thresholding operator 1~L\(Y) eliminates any singular values below a thresh- 
old value A, while the soft-thresholding operator S\(Y) shrinks all the singular values 
by the same amount A towards zero. These two SVD operators are natural extensions 
of the hard/soft-thresholding rules for scalars and vectors (Donoho & Johnstone, 1995; 
Cai et al., 2010). Generally, estimators based on hard-thresholding often have small 
bias but may suffer from large variance; in contrast, soft-thresholding reduces variance 
by introducing extra bias in the estimators, which may be preferable in cases when data 
are noisy and highly correlated (Donoho & Johnstone, 1995). 

The connections between different thresholding rules and penalty terms motivated 
us to consider the adaptive nuclear norm penalization (ANN). The main idea is to build 
a bridge between the Iq and l\ penalties or the HSVT and SSVT rules so as to fine-tune 
the bias-variance tradeoff in the SVD domain. We define the adaptive nuclear norm of 
a matrix C 6 F «: 

h 

/(C)H|C|U = ]Tu^(C), (2.4) 

1=1 
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where h = min(p, q), di(-) is the ith largest singular value of the enclosed matrix, and 
the WiS are the non-negative weights. 

Since the nuclear norm is convex and is a matrix norm, a natural question arises 
as to whether or not its weighted extension ANN preserves the convexity, which is the 
case for lasso and adaptive lasso penalties in the vector case (Zou, 2006). However, 
our analysis shows that the convexity of the ANN depends on the ordering of the non- 
negative weights. The following theorem gives a necessary and sufficient condition of 
its convexity. 

Theorem 2.2 For any matrix C G W xq (n = p, h = mm(p,q)), let f(C) = \\0\\* w 
defined in (2.4)- Then /(■) is convex if and only if W\ > u>2 > • • ■ > Wh > 0. 

Hence, for the ANN to be a convex function, the weight must increase with the sin- 
gular value, i.e., they are co- monotone. However, to use ANN for penalized estimation, 
the opposite is desirable, i.e., we would and shall henceforth impose the following order 
constraint: 

< w 1 < ■ ■ ■ < w h , (2.5) 

in order for larger singular values to receive lesser penalty to help reducing the bias 
and smaller singular values to receive heavier penalty to help promoting sparsity. Here 
is an example showing that the ANN is neither convex nor concave under constraint 
(2.5). Consider n = p = q = 2, and 

Cl = (o i)' C2= (o °)" 

Let Wi = 1 and W2 = 2. It can be verified that /(Ci) = /(C2) = /(— C2) = 4, 
while /((d + C 2 )/2) = 4.5 > (/(d) + /(d))/2; also, /((d - d)/2) = 1.5 < 
(/(d) + /(-d))/2. 

The non-convexity of the ANN arises from the constraint (2.5) that the weight 
decreases with the singular value. In fact, the ANN is then no longer a matrix norm. 
However, we are able to explicitly solve and characterize the global solution of the ANN 
criterion as follows. 

Theorem 2.3 For any X > 0, Ye K nx,? and < w 1 < • • • < w h (n = p, h = 
min(n, q) ), a global optimal solution to the optimization problem 

mm /(C) := l\\\Y- C||| + A^^(C)|. (2.6) 

is given by the adaptive SVD soft-thresholding (ASVT) operator C := S\ w { Y), 

S Xw ( Y) = US Xw (D) V*, S Xw (D) = Diag{(d 4 - \ Wi )+, i = 1, h}, (2.7) 
Further, if Y has a unique SVD, C is the unique optimal solution. 



5 



The fact that a closed-form global minimizer can be found for the non-convex ANN 
problem is not immediately clear and rather surprising. The result stems from the von 
Neumann's trace inequality (Mirsky, 1975) and the properties of SVD, see the Appendix 
for details. Following Zou (2006), the weights can be set as some power of the singular 
values of the data matrix, i.e., w = {d(Y)}~ 7 , where 7 > is a prespecified constant. 
In this way, the order constraint (2.5) is automatically satisfied. A more general way 
of constructing the weights and its relation to other penalty forms will be discussed in 
Section 7. 



3 Adaptive nuclear norm penalization in regression 

We now consider the general problem of estimating the coefficient matrix C, which 
is possibly of low-rank, in the multivariate linear regression model (1.1). Below, let 
P = X(X T X)~X T be the projection matrix onto the column space of X and C = 
(X T X)~X T Y the LS estimator of C, where (X T X)~ denotes the Moore-Penrose inverse 
of the enclosed Gram matrix. Unless otherwise noted, the singular values are always 
placed in non-increasing order. 

3.1 Rank and nuclear norm penalized regression methods 

The fundamental results in Theorem 2.3 about rank and nuclear norm penalization 
for matrix approximation can be readily extended to the general regression setting. 
Consider first the rank penalized least squares criterion (Bunea et al., 2011), 

^||Y-XC||| + Ar(C). (3.1) 
Based on Proposition 2.1, it can be easily shown that the minimizer of (3.1), denoted 

~ (A) ~ ~ ~ 2 ~ T 

as C , can be obtained by hard-thresholding the SVD of XC. Let VD V be the 
eigenvalue decomposition of Y T PY = (XC) T XC. The SVD of XC is then given by 
UDV T , where U = PYVD 1 = XCVD" 1 . Therefore, 

XC W = ?WXC) = XCVD"^^(D)V T , C W = CVD^^^(D)V T . 

(3.2) 

This rank selection criterion (RSC) proposed by Bunea et al. (2011) is valid in high- 
dimensional settings and hence extends the classical rank-constrained RRR approach 
(Reinsel & Velu, 1998). In fact, the set of rank-constrained estimators which minimize 
|| Y — XC|||i subject to r(C) = r (r = 1, min(p, q)), spans the solution path of (3.1). 
The nuclear-norm penalized least squares criterion (NNP) (Yuan et al., 2007) 

J||Y-XC||| + A||C||„ (3.3) 
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does not have an explicit solution in general, and can be computationally intensive 
for large-scale data. Extensive research has been devoted to its optimization problem, 
e.g., Cai et al. (2010), Toh & Yun (2010), etc. One popular algorithm is to iteratively 
conducting a majorization step of the objective function and a minimization step using 
soft SVD-thresholding Cai et al. (2010). 

The performance of the RSC and NNP estimators is related to the bias-variance 
trade-off phenomenon discussed in Section 2. The NNP may be more accurate than the 
RSC when the correlation among predictors is high or the signal to noise ratio (SNR) 
is low, while RSC may perform better when the correlation is moderate and the SNR is 
not too low; see Section 6 for details. A drawback of NNP is that it is computationally 
intensive and is generally not as parsimonious as RSC in rank determination. These 
motivated our study of the ANN for building a continuum of estimators between the 
RSC and NNP estimators. 

3.2 Adaptive nuclear norm penalized regression method 

Predictive accuracy and computation efficiency are both pivotal in high dimensional 
regression problems. Motivated by criteria (3.1) and (3.3) and their connections with 
SVD, we propose to estimate C by minimizing the non-convex PLS criterion 



where h = min(p, q) and the weights {wi} are required to be non-negative and in non- 
decreasing order. In practice, a foremost task of using ANN is setting proper adaptive 
weights. Following Zou (2006), a natural way to construct the weights is based on the 
LS solution: 



where PY is the projection of Y onto the column space of X and 7 a non-negative 
constant. 

The proposed criterion (3.4) is built on two main ideas. Firstly, the criterion directly 
focuses on prediction matrix approximation and encourages sparsity among the singular 
values of XC rather than those of C, which may yield low-rank solutions for XC and 
hence for C. A prominent advantage of this setup is that the problem can then be 
solved explicitly and efficiently Secondly, the adaptive penalization of the singular 
values allows flexible bias- variance tradeoff: large singular values receive small amount 
of penalization to control the possible bias, and small singular values receive large 
amount of penalization to induce sparsity and hence reduce the rank. The following 
Corollary shows that this criterion leads to an explicit ANN estimator. 

Corollary 3.1 A minimizer of (3.4), denoted as tf , is obtained via adaptively soft- 
thresholding the SVD of XC where C is the LS estimator of C, i.e., 



XC iXw) = S Xw {XC) = US Xw {D) V, d Xw) = CVD l S Xw {D) V. (3.6) 




(3.4) 



i=l 



W = 



{d(PY)}~ 7 = d 



(3.5) 



7 



where UD is the SVD of XC as defined in the previous section. 

By Pythagoras' theorem, minimizing the criterion (3.4) is equivalent to minimizing 
{l/2||XC-XC||£, + AX)i'Mi(XC)} with respect to C, where C is the OLS estimator. 
The above result then directly follows from Theorem 2.3. The proposed method first 
projects Y onto the column space of X, i.e., PY = XC, and the ANN estimator 
is then obtained as a low-rank approximation of PY via soft SVD-thresolding; the 
thresholding level is adaptive and can be data-driven: the smaller a singular value, 
the larger its thresholding level. Therefore, the estimated rank of an ANN estimator 
corresponds to the smallest singular value of PY that exceeds its thresholding level, 
i.e., r = max{r : d r (PY) > Xw r }. For the choice of the weights (3.5), i.e., the estimated 
rank is given by 

f = max{r : d r (PY) > A^i}, (3.7) 

and the plausible range of the tuning parameter is A e [0, dj +1 ] , with A = correspond- 
ing to the LS solution and A = d\ +1 the null solution. 

The ANN estimator and the RSC estimator only differ in their singular values but 
the difference can be consequential. While the solution path of RSC is discontinuous 
and the number of possible solutions equals to the maximum rank, the ANN criterion 
offers more flexibility in that the resulting solution path is continuous and guided by 
the data-driven weights. The ANN and RSC are based on the same one-time SVD 
operation and thus they have similar computation complexity and can both be easily 
implemented and efficiently computed, in contrast to the computationally intensive 
NNP method. 

- (Aw) 

For any fixed A > 0, the ANN estimator C can be computed by (3.6). (The same 
SVD operation can be used to compute the RSC solutions.) To choose an optimal A 
and hence an optimal ANN solution, we use the K-io\& cross validation (CV) method, 
based on the predictive performance of the models (Stone, 1974). For the numerical 
studies reported below, we first compute the solutions over a grid of 100 A values equally 
spaced on the log scale and select the best A value by CV; subsequently we refine the 
selection process around the chosen A value with another finer grid of 100 A values. 

4 Rank consistency and error bounds 

We study the rank estimation and prediction properties of the proposed ANN estimator. 
Our theoretical analysis is built on the framework developed by Bunea et al. (2011), 
as RSC and ANN are closely connected. We mainly focus on the random weights 
constructed in (3.5), in line with the adaptive Lasso method (Zou, 2006) developed for 
the univariate (multiple) regression. Similar results are obtained for any prespecified 
sequence of weights satisfying certain order restriction and boundedness requirements. 
All the proofs are given in the Appendix. 
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The rank of the coefficient matrix C can be viewed as the number of effective 
combination of predictors linked to the responses. Rank determination is always a 
foremost task of reduced-rank estimation. The quality of rank estimator, defined by 
(3.7), clearly depends on the signal to noise ratio. Following Bunea et al. (2011), we 
shall use the smallest non-zero singular value of XC, i.e., cZ r *(XC), to measure the 
signal strength, and use the largest singular value of the projected noise matrix PE, 
i.e., <ii(PE), to measure that of the noise. Intuitively, if di(PE) is much larger than 
the size of the signal, some signal could be deeply masked by the noise and lost during 
the thresholding procedure; as such, f may be much smaller than the true rank. The 
lemma below characterizes the "limit" or the true target of f and its relationship with 
the noise level. 

Lemma 4.1 Suppose that there exists an index s < r* such that d s {XC) > (1 + 
5)\ l '^ + V and d s+1 (XC) < (1 - 5)X 1 /^ +1 ^ for some 5 G (0,1]. Then P(f = s) > 
1 — P(di(PE) > SX 1 ^ 1 ^), where P is the projection matrix onto the column space of 
X, E is the error matrix in model (1-1), and 7 is the power parameter in the adaptive 
weights (3.5). 

This result establishes the relationship between the estimated rank, the signal level, 
the noise level and the adaptive weights. To achieve consistent rank estimation, we 
consider the following assumptions: 

Assumption The error matrix E has independent iV(0, a 2 ) entries. 

Assumption For any 9 > 0, assume A = {(1 + 8)a( y /r x ~ + yfq)/o~} 1+l with 5 defined 
in Lemma 4.1, and assume <i r *(XC) > 2X 1 ^" /+1 \ 

Assumption 1 is about the error structure, which ensures that the noise level di(PE) 
is of order ^Jr^ + ^/q, see Lemma .1 (Bunea et al., 2011). Assumption 2 concerns 
the signal strength relative to the noise level and the appropriate rate of the tuning 
parameter. 

Theorem 4.2 Suppose Assumptions 1-2 hold. Let r* = r(C) be the true rank, r x = 
r{X) be the rank of X, and r be the estimated rank defined by (3. 7). Then P(r = r*) — > 1 
as r x + q — > 00 . 

Theorem 4.2 shows that the ANN estimator is able to identify the correct rank with 
probability tending to 1 as r x + q goes to infinity. Similar to Bunea et al. (2011), the 
consistency results can be extended to the case of sub-Gaussian errors and can also 
be easily adapted to the case when r x + q is bounded and the sample size n goes to 
infinity. Therefore, the rank consistency of the proposed ANN estimator is valid for 
both classical and high-dimensional asymptotic regimes. 

Our main results about the prediction performance of the proposed ANN estimator 

- (Aw) 

are presented in Theorem 4.3 below. For simplify, we write C for C 
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Theorem 4.3 Suppose Assumptions 1-2 hold. Let c = d\(XC) / d r *{XC) > 1. Then 

\\XC-XC\\l < ^^\\XB-XC\\ 2 F +— - — -{V25 + 2(2-5r 1 - (2c + 5)^Y \^ r* , 
1 — a a(l — a) I J 

with probability greater than 1 — exp(— 8 2 (r x + q)/2), for any < a < 1 and any p x q 
matrix B with r(B) < r* . Moreover, taking B = C and a = 1/2 yields 

\\XC- XCf F < A{V25 + 2(2 - 5Y 1 - (2c + 5)~^} 2 X^r* 

= 4{V2 + 2(2 - 5)-y5 - (2c + 5yy5} 2 {l + 9) V(V^ + v^) V 

u>zi/i probability greater than 1 — exp(— 6* 2 (r x + q)/2). 

The above established bound shows that the prediction error is bounded by (ij(PE)r* 
up to some constant with probability 1 — exp(— 9 2 (r x + q)/2), i.e., the smaller the error 
size or the true rank, the smaller the prediction error. The bound is valid for any 
X and C. The estimation error bound of C can also be readily derived from Theo- 
rem 4.3, e.g., if d rx (X.) > p > for some constant p, then under Assumptions 1-2, 
||C - C||| < Ap- 2 {V25 + 2(2 - (J)-t - (2c + 5)~y 2 \^r*. 

The rank consistency and prediction bound can be similarly established for any 
prespecified sequence of weights satisfying 

< W\ < ■ ■ ■ < Wf, w r * < M < w r * + i, (4.1) 

where < M < oo, w r * + i > and f = min(r x ,q); the index s in Lemma 1, the 
requirements on the tuning sequence and the signal level in Assumption 2 shall be 
modified accordingly, 

d s (XC) > (1 + 5)\w s , i + i(XC) < (1 - S)\w s+1 for some 5 E (0, 1], (4.2) 
A = (l + 0) ( r( v ^;+ v ^)( ( 5M)" 1 ,4*(XC) >2XM. (4.3) 



Corollary 4.4 Suppose that Assumption 1 and (4-l)-(4-3) hold. Then 

(1) P(f = r*) — > 1 as r x + q — )■ oo; 

(2) \\XC- XC\\ 2 < l±H\\XB - XC\\% + — - — -{(2 + V25)M - wA 2 X 2 r* 

1 — a a(l — a) 

for any < a < 1 and B with r(B) < r*; 

(3) \\XC- XCf F < A(V2 + 2/5 - Wl / (M5)) 2 (l + e) 2 a 2 (yr x + Vq) 2 r* 

with probability greater than 1 — exp(— 6 2 (r x + q)/2). 
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The proof is similar to that of Theorems 4.2 and 4.3 and hence is omitted. 

The error bounds of the ANN estimator established in Theorem 4.3 and Corol- 
lary 4.4 are comparable to those of the RSC and NNP estimators (Bunea et al., 2011; 
Rohde & Tsybakov, 2011). The rate of convergence is (r x + q)r*, which is the optimal 
minimax rate for rank sparsity under suitable regularity conditions (Rohde & Tsybakov, 
2011; Bunea et al., 2012). However, the bounds for NNP was obtained with some extra 
restrictions on the design matrix, and its tuning sequence that achieves the smallest 
mean squared error (MSE) usually does not lead to correct rank recovery (Bunea et al., 
2011). While both RSC and ANN are able to achieve correct rank recovery and mini- 
mal MSE simultaneously, the latter possess continuous solution path with data-driven 
tuning which may lead to improved empirical performance. 



5 Robustification of the reduced-rank estimation 

As suggested by a referee and motivated by Mukherjee & Zhu (2011), we discuss the 
robustification of the reduced-rank methods by incorporating extra li penalty in the 
penalized criteria. 

Mukherjee & Zhu (2011) proposed the robust reduced-rank ridge (RoRR) method 
which performs I2 penalized ridge regression under rank constraint. The shrinkage 
estimation induced by the I2 penalty makes the reduced rank estimation robust and 
especially suitable when the predictors are highly correlated. The method can be viewed 
as the following PLS criterion that was also mentioned in Bunea et al. (2011), 

\\\Y- XC\\% + X 1 r(C) + h 2 dl(C), (5-1) 

i=l 

where h = min(p, q), ^jdj(C) = tr(C T C), and Ai and A2 are tuning parameters. The 
problem can be solved via data augmentation. Specifically, let 

rW/V x- I x 



Opxg/ \yX2IpXp 

then (5.1) can be written as an RSC criterion 1/2|| V* — X*C\\p+\ir(C), whose solution 
is given by (3.2); see Mukherjee & Zhu (2011). 

Similarly, the proposed ANN method can also be robustified by incorporating a 
ridge penalty term. Similar to (3.4), for efficient computation, we impose an I2 penalty 
on XC rather than C, 

h h 

-\\Y - XC\\ 2 F + XiJ2 Widi(XC) + -A 2 d-(XC). (5.2) 

i=l i=l 

Interestingly, this robustified ANN criterion (RoANN) is analogous to the adaptive 
elastic net criterion (Zou & Hastie, 2005) in univariate regression. (The case of imposing 
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one or both penalties on C directly is more complex and will be pursued elsewhere.) It 
can be easily verified that, for fixed tuning parameters, the objective function (5.2) is 
minimized at 

Q = - (j(^ lW ) 
1 + A 2 

where C^ Xlw ^ denotes the ANN estimator in the absence of the Z 2 penalty. Indeed, the 
extra 1% penalty induces overall shrinkage of the ANN estimator. 

For each fixed A2, the RoRR method requires inverting a p x p matrix of the form 
(X T X + A2I) and performing an SVD of a q x q matrix. When p is much bigger than n, 
the Woodbury matrix identity is useful in speeding up computation (Hager, 1989), i.e., 
(X T X + A 2 ir 1 = 1 /A 2 I - 1 /A^X T (I + 1 /X 2 KX T )- 1 X. On the other hand, the RoANN 
method only requires one-time matrix inversion and SVD operation for obtaining the 
whole solution path, thereby saving computation. We shall examine the effects of an 
additional l 2 penalization by simulation in Section 6. Due to space limit, relevant 
theoretical analysis will be pursued elsewhere. 

6 Empirical studies 
6.1 Simulation 

We compare the prediction, estimation and rank determination performances of the 
NNP estimator proposed by Yuan et al. (2007), the RSC estimator proposed by Bunea et al. 
(2011), the RoRR estimator proposed by Mukherjee & Zhu (2011), and our proposed 
ANN and RoANN estimator. In ANN estimation the adaptive weights are constructed 
as (3.5) with 7 = 0, 1, 2 and we denote the resulting estimator as ANN 7 (7 = means 
unweighted ANN). In the numerical results reported below, we use the accelerated 
proximal gradient algorithm implemented in Matlab by Toh & Yun (2010) for NNP es- 
timation. R code for RoRR was provided by their original authors (Mukherjee & Zhu, 
2011), and we modified their code to make use of the Woodbury matrix identity (Hager, 
1989) for saving computation. We have also implemented all the other methods in R 
(R Development Core Team, 2008). All computation was done on linux machines with 
3.4 GHz CPU and 8 GB RAM. 

We consider the same simulation models as in Bunea et al. (2011). The coefficient 
matrix C is constructed as C = 6C C[, where b > 0, C G 9? pxr *, Ci G 9^ xr * and 
all entries in Co and Ci are i.i.d. N(0,1). Two scenarios of model dimensions are 
considered, i.e., p,q < n and p,q > n. 

• Model I (n — 100, p = q = 25, r* = 10): The covariate matrix X is constructed 
by generating its n rows as i.i.d. samples from a multivariate normal distribution 
MVN(0, T), where Y = (Tij)p X p and r - = p li ~ jl with some < p < 1. 

• Model II (n = 20, p = q = 25, r* = 5, r x = 10): The covariate matrix X is 
generated as X = XoP 1 / 2 , where T is defined as above, X = XiX 2 , Xi G 3l nxrx , 
X 2 g ^ xp , and all entries of Xi, X 2 are i.i.d N(0, 1). 
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The data matrix Y is then generated by Y = XC + E, where the elements of E are 
i.i.d. samples from N(0, 1). It can be seen that each simulated model is characterized 
by the following parameters : n (sample size), p (number of predictors), q (number 
of responses), r* (rank of C), r x (rank of X), p (design correlation), and b (signal 
strength). The experiment was replicated 500 times for each parameter setting. 

To alleviate the influence of the inaccuracy in empirical tuning parameter selection 
and to reveal the true potential of each penalized method for fair comparison, one 
way is to tune each method based on its prediction accuracy evaluated on a very large 
independently generated validation data set; this yields "optimally tuned" estimators 
denoted as NNP {0) , RSC {0) , ANN^), etc. The 10-fold cross validation method is also 

used with the actual data, which results in the estimators NNP (C) , RSC (C) , ANN (C,) , etc. 
(Although to save space, we only report the CV results of RSC and ANN 2 .) For each 
method, the model accuracy is measured by the average of the scaled mean-squared- 
errors (SMSE) from all runs, i.e., SMSE = 100 ||C — C\\ 2 F /{pq) for estimation (Est), and 
SMSE = 100||XC — XC||fi/(ng) for prediction (Pred). Their standard errors are also 
reported. To evaluate the rank determination performance, we report (1) the average of 
the estimated ranks from all runs, and (2) the percentage of correct rank identification. 
For each method, the average computation time per replication is reported. 

Tables 1 and 2 report the simulation results for Model I and II, respectively. We 
summarize our findings as follows. 

• We first examine the performance of ANN 7 with different 7 values. The ANN , 
which does not use the adaptive weights, is not as accurate as the ANN with 
adaptive weights in both rank estimation and prediction in general. Its behavior 
is similar to that of the NNP: they both tend to overestimate the rank but may 
perform well when the signal is weak and the model dimension is not high (Model 
I). The performance of both ANNx and ANN 2 is substantially better than ANN 
with the aid of adaptive weights. We have also experimented with other 7 values, 
and as expected, the ANN estimator behave more and more similar to the RSC 
as 7 increases. Our results show that 7 = 2 is generally a very good choice. 
Henceforth we always refer to ANN 2 in the following comparisons with other 
reduced-rank methods. 

• The ANN 2 generally outperforms the RSC in both estimation and prediction. 
The improvement can be substantial, especially when the signal is weak or mod- 
erate and the correlation among the predictors is high. For rank determination, 
both the ANN and the RSC have similar excellent behavior when the signal is 
moderate to strong and the correlation among the predictors is weak to moder- 
ate. We notice that the ANN estimator tends to slightly overestimate the rank; 
however, the overestimation is generally negligible. When the signal strength is 
moderate to large and correlation among predictors is weak to moderate ANN 
does slightly worse in terms of rank selection than RSC yet is able to maintain a 
small gain in estimation and prediction. This gain is due to the shrinkage effect 
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Table 1: Comparison of estimation, prediction and rank determination performances 
of various reduced-rank estimators using Model I (n — 100, p = 25, q = 25, r* = 10). 
The superscript ^ stands for optimal tuning, and stands for cross validation. The 
estimation error (Est) and prediction error (Pred) are reported along with their standard 
errors in the parenthese. For rank estimation (Rank), the average of estimated rank 
and the percentage of correct rank identification are reported. The simulation is based 
on 500 replications, and the average running time of each replication is reported in 
seconds (Time). 
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Table 2: Comparison of estimation, prediction and rank determination performances of 
various reduced-rank estimators using Model II (n = 20, p = 100, q = 25, r* = 5,r x = 
10). The layout of the table is the same as in Table 1. 
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using soft-thresholding. While RSC is based on hard-thresolding and keeps the 
first few leading SVD layers completely as signal, in reality all the SVD layers are 
contaminated by noise, and hence the shrinkage estimation is effective. When the 
signal is weak and the correction among the predictors is high, the rank determi- 
nation performance of the RSC can be much worse than that of the ANN, which 
is also reflected in their differences in prediction and estimation. 

• The ANN 2 also outperforms the NNP in general, and is more parsimonious than 
the NNP in both rank reduction and computation. Only when the signal is weak 
and the correlation among predictors is very high, may the NNP method slightly 
outperform the ANN (and hence RSC) in estimation and prediction. However, 
this gain exacts a high cost that may make it not worthwhile, as the NNP often 
excessively overestimates the rank and is much harder to compute. Our findings 
regarding the NNP agrees with those reported in Bunea et al. (2011). Note that 
Bunea et al. (2011) also proposed a calibrated NNP method for rank determina- 
tion, which showed very similar behavior as that of the RSC. (Hence it is not 
reported here.) 

• Adding an Z 2 penalty usually boosts the performance of the reduced-rank meth- 
ods. The RoRR may substantially outperform its non-robustified counterpart 
RSC, especially when the correlation is high, showing the power of shrinkage 
estimation. The RoANN 2 only slightly outperforms ANN 2 , because ANN 2 itself 
performs adaptive shrinkage estimation and thus the gain from the overall shrink- 
age induced by the Z 2 penalty is much limited. RoRR, ANN 2 and RoANN 2 have 
comparable performance in most cases and are the best methods. 

• NNP is much more computationally expensive than the other reduced-rank meth- 
ods. Both RSC and ANN are very fast to compute, and in practice they can 
always be computed together as they rely on the same SVD operation. Adding 
an I2 penalty increases computation time. RoRR takes longer computation time 
than RoANN, for the reason explained in Section 5. 

Overall, the ANN approach is preferable to both the RSC and the NNP methods, 
especially for cases when the data are noisy and the correlation among the predictors 
is high. Adding extra l 2 penalty is certainly worthwhile, especially for RSC, but it may 
incur a lot of computation efforts. 

6.2 An application in Genomics 

We consider a breast cancer data set (Witten et al., 2009; Bunea et al., 2010), which 
consists of the gene expression measurements and the comparative genomic hybridiza- 
tion (CGH) measurements for n = 89 subjects. The data set is available in the R 
package PMA, and a detailed description can be found in Chin et al. (2006). 
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Table 3: Performances of reduced-rank estimators on breast cancer data. Setting 1 
regresses GEPs on CNVs, and setting 2 regresses CNVs on GEPs. 







Method 






NNP C RSC C RoR c ANN^ RoANN^ 




Setting 1: 


Clirornosom6 14 (^q 641, p 


— 76*1 


CVE 


0.84 


1.00 0.85 0.96 


0.92 


Rank 


4 


7 1 


1 


Time 


2190.4 


5.5 689.0 24.6 


38.2 




Setting 2: 


Chromosome 14 (q = 76, p = 


= 641) 


CVE 


0.69 


0.59 0.59 0.58 


0.58 


Rank 


23 


5 5 11 


17 


Time 


2158.5 


13.5 186.7 13.8 


16.8 




Setting 1: 


Chromosome 21 (q = 227, p 


= 44) 


CVE 


0.84 


0.95 0.83 0.87 


0.86 


Rank 


3 


12 1 


1 


Time 


1266.4 


0.5 66.2 2.6 


8.4 




Setting 2: 


Chromosome 21 (q = 44, p - 


= 227) 


CVE 


0.69 


0.65 0.63 0.62 


0.62 


Rank 


5 


1 1 1 


2 


Time 


1226.0 


0.5 28.6 0.7 


2.9 



Prior studies have demonstrated a link between DNA copy-number changes and 
cancer risk (Pollack et al., 2002; Peng et al., 2010). It is thus of interest to examine the 
relationship between DNA copy number variations (CNVs) and gene expression profiles 
(GEPs), for which multivariate regression methods can be useful. Biologically, it makes 
sense to regress GEPs on CNVs since the latter play an important role in regulating the 
former. The reverse approach of regressing CNVs on GEPs is also meaningful, in that 
the resulting predictive model may identify functionally relevant CNAs; this approach 
has been shown to be promising in enhancing the limited CGH analysis with the wealth 
of GEP data (Geng et al., 2011; Zhou et al., 2012). We thus try both approaches, i.e., 
setting 1: designate the CNVs on the CGH spots of a chromosome as predictors (X nxp ), 
and the GEPs of the same chromosome as responses (Y nX(? ), and setting 2: reverse the 
roles of X and Y. Both the responses and predictors are centered and standardized. 

We focus on chromosomes 14 and 21, which were previously studied in Bunea et al. 
(2010). In this study, p and q are either comparable or much larger than n = 89. To 
alleviate the high- dimensionality problem, the reduced-rank methods are appealing for 
identifying a few linear combinations of predictors for optimally predicting the response 
variables. We perform model estimation using various reduced-rank methods, with the 
tuning parameters selected by ten-fold cross validation. (We obtained similar results 
with five-fold and fifteen-fold cross validations.) The cross validation error rate (CVE) 
(averaged over the number of response variables and the sample size) is then used 
to compare the predictive performance of different penalization schemes. Note that 
CVE = 1 corresponds to a null model with zero coefficient matrix. Table 3 reports the 
CVE, the estimated rank and the computation time. 

In setting 1, the SNR is very low as reflected by the CVEs being close to 1. The 
RSC may fail to pick up any signal (for chromosome 14), while the other methods, 
especially NNP and RoRR, perform better owing to the power of shrinkage estimation. 
In setting 2, the SNR is relatively higher, and the ANN and RoANN have better 
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prediction performance than all the other methods. In Section 4, we have shown that 
the prediction error is of the order (r x + q)r*, where r x < min(n,p) is the rank of the 
design matrix X. This may partly explain why in setting 1 the prediction is always 
poorer than in setting 2, because q is much larger than p when the CNVs (GEPs) serve 
as predictors (responses). For rank estimation, the NNP method always yields higher 
rank estimates than the other methods; the ANN estimated rank is higher than that of 
RSC for chormosome 14 in setting 2, otherwise their rank estimates are similar. Adding 
I2 penalty improves the predictive performance of the reduced-rank methods, especially 
the improvement from RSC to RoRR can be substantial. The robustified methods may 
also yield higher rank estimates. As borne out by the simulation study, such behaviors 
result from the hybridization between reduced-rank methods and ridge regression. Both 
NNP and RoRR are computationally intensive for large datasets, while RSC, ANN and 
RoANN are much faster to compute. Overall, it can be seen that the proposed ANN 
approach shows better performance than the RSC, with not much extra computational 
cost. 



7 Discussion 

There are several potential directions for future research. We have mainly considered 
the adaptive nuclear-norm penalization on XC which yields a computationally efficient 
SVD-thresholding estimator for dimension reduction and shrinkage estimation. It is 
interesting to consider an ANN criterion that puts the adaptive nuclear-norm penal- 
ization directly on the coefficient matrix C, i.e., {||Y — XC||^ + A Widi(C)}. This 
criterion may be optimized by similar iterative SVD-thresholding method used in solv- 
ing the NNP problem. Although the computation will be more intensive, the advantage 
is that this criterion would result in simultaneous adaptive rank reduction and shrinkage 
estimation. 

The proposed ANN method can serve as the building block to study a family of 
singular value penalties. This is based on the connection between an adaptive Zi-type 
penalty and many concave penalty functions such as SCAD (Fan & Li, 2001) and bridge 
penalty (Knight & Fu, 2000). Consider the general regression problem (1.2) with a 
general singular value penalty V\(C) = Yli=iP^(di), where p x (-) is a penalty function, 
e.g., l q bridge penalty p\{\di\) = X\di\ q (0 < q < 1) (Rohde & Tsybakov, 2011). In this 
setup the optimization of (1.2) can be challenging. A promising approach is to adopt a 
local linear approximation (Zou & Li, 2008) , p\( \di\) ~ p\ ( | df^ | ) +p' x ( | df' \ ) (| di\ — \ d[ ^ | ) , 
for di « af \ where df^ is some initial estimator of di which for example can be obtained 
by the LS method. It can be seen that for fixed df \ up to a constant, the first-order 
approximated penalty admits exactly an ANN form. This suggests the ANN estimator 
with the weights p'\{\df\) can be viewed as an one-step estimator of these problems, 
and these problems may be solved by an iteratively reweighted ANN approach. 

It is shown that incorporating an extra ridge penalty can induce further shrink- 
age and hence improve the reduced-rank estimation. When combined with the ANN 
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penalty, such a criterion bears resemblance to the elastic-net criterion (Enet) (Zou & Hastie, 
2005) in univariate regression. It would be interesting to investigate the properties of 
the SVD-Enet approaches. Another pressing problem concerns further extending the 
regularized reduced-rank regression methods to generalized linear models and nonpara- 
metric regression models (Yee & Hastie, 2003; Li & Chan, 2007). On the optimization 
aspect, it is interesting to study the usage of ANN in some classical sparse optimization 
areas, such as matrix completion (Candes et al., 2011). 



Technical details 



Proof of Theorem 2.2 



Proof First we show by a counter example that if we have an index k such that 
Wk < iffc+i) then /(■) is non-convex. Let C and D be diagonal p x q matrices such that 
cu = i, for 2 = 1, h, while D equals C but with entries switched at positions h — k + 1 
and h — k on the diagonal. It is then easy to verify that 



f(C) = f(p)=jT l w i (h-i+l), 



i=i 



/(^y^) " ^ " ^ =(h~k + 0.5)K + w k+l ) -(h-k+ l)w k -(h- k)w k+1 

=0.5(w k + w k+ i) -w k >0, 

where /(•) is defined in (2.4). Therefore /(•) is non-convex. 

Next we prove that for w\ > ■ ■ ■ > Wh > 0, /(■) = || ■ is a convex function. First 
consider the case that Wh > 0, and define the following function on 9ft h : 



i=l 



where 5 is a permutation of {l,...,h} determined by x such that Ix^^ > |x| 5 ^ > 
■ ' ' > l x la(/i)) where |x| is the vector of absolute values of x. We claim that w(-) is a 
symmetric gauge function (see Horn & Johnson (1985, Definition 7.4.23) for reference), 
i.e., it satisfies the following six conditions: (a) w(x) > 0, Vx; (b) iw(x) = if and only 
if x = 0; (c) w(ax) = |a|u>(x), Va 6 9ft; (d) u>(x + y) < io(x) + w(y); (e) iw(x) = w;(|x|); 
(f) iw(x) = iu(r(x)) for any r is a permutation of indices {1, h}. 

All conditions except (d) are trivial to verify. Now we prove (d). Let S,a,T be per- 
mutations such that ||x + \, |l x l CT (i)| an d {|ylr(i)l are P^ ace( i i R non-increasing 
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order respectively. 

h h 

y ) = ^Wi\x + y\ S Q<^{wi\x\f ({) +Wi\y\s (i) } 



m;(x 

i=l 
h 

1=1 



n 

{ Wi l x '-w + Wi l y lrw} = ^( x ) + w (y)- 



where the second inequality is due to the Hardy-Littlewood-Polya inequality (Hardy et al., 
1967). 

Then by a straightforward application of Horn & Johnson (1985, Theorem 7.4.24), 
since ||C|| TO * = w([di(C),d 2 (C), c4(C)] T ), || • \\ ws „ defines a matrix norm and hence is 
a convex function. 

For the case that Wh = 0, let s to be the largest index such that w s > 0. For 

< e < w s , consider the perturbated w that Wi = Wi, for i — 1, s, and Wi = e, for 

1 = s + l,...,h. Then for any A, B G 9^, H^IU* < M^ + M^. By taking e ^ 0, 
HnplL. < + ^ £i - Therefore is convex. 

Proof of Theorem 2.3 

Proof We first prove that C is indeed a global optimal solution to (2.6). Since the 
penalty term only depends on the singular values of C, by letting g = {gi}^ =1 = d(C) 
(which implies the entries of g are in non-increasing order), (2.6) can be equivalently 
written as: 

h 




S-9l>—>9h>0 1 cgr x « [2 

d(C) = g 



For the inner minimization, we have the inequality 

||Y-C||| = tr(Y-C)(Y-C) T 

= tr(YY T ) -2tr(YC T ) +tr(CC T ) 



h 

2 



= ^(Y)-2tr(YC T ) + ^ 

i=l i=l 
h h 

> E^( Y )- 2d ( Y ) T g+E^ 



i=l i=l 

The last inequality is due to von Neumann's trace inequality. See Mirsky (1975) for 
a proof. The equality holds when C admits the singular value decomposition C = 
UDiag(g)V T , where U and V are defined in (2.1) as the left and right orthonormal 
matrices in the SVD of Y. Then the optimization is reduced to 

,A» {t & ~ lMY) ~ Al *- + s d?(Y) ) } ' ( ' 2) 

20 



The objective function is completely separable and takes minimum when gi = (di(Y) — 
\wi)+. This is a feasible solution because {di(Y)} is in non-increasing order, while {wi} 
is in non-decreasing order. Therefore C = S\ W (Y) = UDiag{(d(Y) — Aw) + }V T is a 
global optimal solution to (2.6). The uniqueness follows by the equality condition for 
von Neumann's trace inequality when Y has a unique SVD, and the uniqueness of the 
strictly convex optimization (.2). This concludes the proof. 

Proof of Theorem 4.2 

of Lemma 4.1 By (3.7), f > s d s+1 (PY) > A^t an d f < s -<=>- cZ s (PY) < 
i 

X-1+ 1 . Then 

P(f ^ s) = P{d s+ i(PY) > A^r or rf s (PY) < A^r}. 

Based on the Weyl's inequalities on singular values (Franklin, 2000) and observing 
that PY = XC + PE, we have di(PE) > d s+1 (PY) - 4+i(XC) and di(PE) > 
d s (KC) - d s (PY). Hence i +1 (PY) > A^ implies di(PE) > A^r - d s+1 (XC), and 
d»(PY) < A^r implies di(PE) > d s (XC) - A^i. It then follows that 

P(f ^ s ) < PR(PE) > min(A^r _ d s+1 (X.C), d s (XC) - A^r)}. 

Finally, note that min(A^r _ d s+1 (XC) , d s (XC) - A^tt) > 5\~. This completes the 
proof. 

Lemma .1 (Lemma 3 of Bunea et al. (2011)) Letr x = rank(X) and suppose As- 
sumption 2 holds. Then for any t > 0, E[di(PE)} < a(- s /r^ + yfq), and P{di(PE) > 
Eld^PE)} + at} < exp(-t 2 /2). 

of Theorem 4.2 When d r *(XC) > 2A^tt, we have 

d r *(X.C) > 2A^r > (1 + 5)A^r, and d r .+i(XC) = < (1 - 8)X^, 

for some < 5 < 1. It can be seen that the effective rank s defined in Lemma 4.1 equals 
to the true rank, i.e., s = r*, and min(A~ - d r * + i(XC), d r *(XC) - A~) > 5\~. It 
then follows by using the properties of Gaussian errors presented in Lemma .1 that 

P(f = r *) > 1 - P(di(PE) > 5A^r) 

= 1 - PR(PE) > (1 + 9)a(^F x + y/q)} 
> i _ exp(-£ 2 (r x + q)/2) -> 1. 
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Proof of Theorem 4.3 

Proof By the definitions of C in (3.6), 

||Y-XC|| F + 2A^tMi(XC) < ||Y - XB||J. + 2A ^to^(XB), 

for any p x q matrix B. Note that 

||Y-XC|| F = ||Y-XC||| + ||XC -XC|| F + 2 < E,XC - XC > F , 
|| Y - XB|| F = || Y - XC||| + ||XB - XC|| F + 2 < E, XC — XB > F . 

Then we have 

||XC-XC|| F 

<||XB - XC||| + 2 < E,XC - XB > F +2A{^iu i d i (XB) - ^ Wi di(XC)} 

<||XB - XC||| + 2 < PE,XC - XB > F +2A{^tMi(XB) -^^^(XC)} (.3) 

<||XB - XC||| + 2di(PE)||XC - XB||, + 2A{^w^(XB) -^^(XC)} 

<||XB - XC|| F + 2d 1 (PE)y / r(XC - XB)||XC - XB|| F + 2A{^w i d i (XB) - J^d^XC)}. 
Now consider any B with r(B) < f , 

^2w i d i (XB)-^2w i d i (XC) 

f f f f 

=Wr ^ rf i( XB ) - W r Yl d i( X C) + Y( W r ~ ™*K(XC) - ^(iVf ~ Wi)di(XB). 

i=l i=l i=l i=l 

By the definition of the adaptive weights in (3.5), i.e., Wi = d~ 7 (PY), we know that 
Wf — W\ > ■ ■ ■ > Wf — Wf—i > 0. Therefore, both pi(-) = Yll=i d i(') an d Pz(') = 
YH=i( w f ~ w i)di(-) satisfy the triangular inequality; see the proof of Theorem 2.2. 
Moreover, based on Weyl's inequalities (Franklin, 2000) and PY = XC+PE, d f (PY) > 
d f (XC) - di(PE) and di(PY) < di(XC) + di(PE). It follows that 

^2w i d i (XB)-^2w i d i (XC) 

r f 

< W f Y d i( X C ~ XB ) + ^2( w f - ™iK(XC ~ XB ) 

i=l t=l 

r 

< {2dp(PY) - d-\PY)} ^di(XC - XB) 

i=l 

f 

< {2(d f {XC) - c/i(PE))" 7 - (di(XC) + d 1 (PE))^} d i( x ^ - XB) 

i=l 

< {2(d f {XC) - di(PE))^ - (di(XC) + d 1 {PE))^} V?\\XC - XB\\ F . 
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The last inequality is due to the Cauchy-Schwarz inequality. Using (.3), r(XC-XB) < 
r(C — B) < 2f and the inequality 2xy < x 2 /a + ay 2 we have 

||XC - XC||| <||XB - XC||| + a||XC - XB||| 

+ - |di(PE)\/2? + 2A(rf f .(XC) - ^(PE))" 7 ^ - A(di(XC) + ^(PE))"^}' 

Since ||XC-XB|||< ||XC - XC||| + ||XB - XC|||, consequently, for any < a < 1, 

||XC - XCII 2 <i^||XB - XC\\ 2 F 
1 — a 

+ jz 1 r { v^di(PE) + 2A(^(XC) - di(PE))" 7 - A(di(XC) + di(PE))^V 



r. 



As shown in Theorem 4.2, on the event {di(PE) < 5X~t +1 }, the estimated rank f equals 
to the true rank r*, i.e., f = r*, and P{di(PE) > 5A^+i} < exp(— 8 2 (r x + g)/2). Also, 
d r *(XC) > 2A~ and c = di(XC)/d r *(XC) > 1. Therefore, with probability at least 
l-exp(-0 2 (r* + ?)/2), 

IIXC - XCII 2 <^^||XB - XCII 2 + — — r (V25X^ + 2X(2 - 5)^X^ - X(2c + 5)^X^Y r* 



F - -i 

1 — a 



XCII 2 + — - — -\V25X^ +2X(2-5)^X^ - A(2c+ <5)" 7 A^ V 
a(l — a) I J 

<i±5||XB - XC||£ + - 1 . \V25 + 2(2 - 5)^ - (2c + 5)~A 2 X^r*. 
1 — a a(l — a) l ) 

Since B is an arbitrary matrix with r(B) < r*, the second part of the theorem is 
obtained by taking B = C and a = 1/2. This completes the proof. 

Proof of Corollary 4.4 
Proof Consider any B with r(B) < f, we have 

^iMi(XB) -^^(XC) 
=w f .^2di(XB) - Wf^diiXC) + J2( w r - Wi)di(XC) - ^(vjf - w^d^XB). 

i=l i=l i=l i=l 

Note that w? — w\ > ■ ■ ■ > w? — Wf-i > 0, so both p±(-) = ^i=i^*(') anc ^ Pz(') = 
Y^i=i( w f ~ Wi)di(-) satisfy the triangular inequality. See the proof of Theorem 2.2. It 
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then follows that 

f f 
J2widi(XB) - ^2widi{XC) <Wr d i( X C - XB ) + J2( Wf ~ w i) d i( x C - XB ) 



i=l 



8=1 



J2( 2w ? - ^H( xc - xb ) 



i=l 



J2( 2w ? - ^) 2 II XC - XB|| F . 



The last inequality is due to the Cauchy-Schwarz inequality. 
Using (.3) and the inequality 2xy < x 2 /a + ay 2 we have 

||XC -xcni 



<||XB - XC||> + - < di(PE)i/r(XC - XB) + A 



^^(2^f - w t ) 2 ) +a||XC-XB||^ 



<(1 + a)||XB - XC||> + - { rf 1 (PE) A /r(XC - XB) + A, 



^2{2wr-Wi) 2 \ +a||XC-XC| 



Consequently, for any < a < 1, 



IXC - XCIIi <^^||XB - XC||| + 
1 — a 



all — a) 



d 1 (PE)V2f + X A 



Wf — Wi 



As shown in Theorem 4.2, on the event {di(PE) < 5AM}, the estimated rank f equals 
to the true rank r*, i.e., f = r*, and P{di(PE) > 5XM} < exp(-9 2 (r x + q)/2). 
Therefore, with probability at least 1 — exp(— 9 2 {r x + q)/2), 

( I "\ 2 

:||2 p <l±^||xB-XC|| 2 ,+ 



< 



1 — a a(l — a) 

1 + a, 



\5MV2r* + A, 



r" 



, ^(2w r * - Wi)' 
\ i=i 



,XB - XC||| + ./ , AV{(2 + V25)M - Wl } 2 . 
1 — a all — a) 



Since B is an arbitrary matrix with r(B) < f, the second part of the theorem is obtained 
by taking B = C and a = 1/2. This completes the proof. 
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